Methods for the survey and genetic analysis of populations

ABSTRACT

The present invention relates to methods for performing surveys of the genetic diversity of a population. The invention also relates to methods for performing genetic analyses of a population. The invention further relates to methods for the creation of databases comprising the survey information and the databases created by these methods. The invention also relates to methods for analyzing the information to correlate the presence of nucleic acid markers with desired parameters in a sample. These methods have application in the fields of geochemical exploration, agriculture, bioremediation, environmental analysis, clinical microbiology, forensic science and medicine.

This application is a continuation application of U.S. patentapplication Ser. No. 13/908,800, filed Jun. 3, 2013 (pending), which isa continuation application of U.S. patent application Ser. No.13/019,045, filed Feb. 1, 2011 (now U.S. Pat. No. 8,476,016), which is acontinuation application of U.S. patent application Ser. No. 10/607,077,filed Jun. 25, 2003 (now U.S. Pat. No. 8,071,295), which is a divisionalapplication of U.S. patent application Ser. No. 09/829,855, filed Apr.10, 2001 (now U.S. Pat. No. 6,613,520), which claims the benefit of U.S.provisional application 60/196,063, filed Apr. 10, 2000 (now expired)and U.S. provisional application 60/196,258, filed Apr. 11, 2000 (nowexpired), the entire disclosures of each of which applications areincorporated by reference herein.

This Invention was made with Government support under Contract No.DE-FC36-01GO11016 awarded by the Department of Energy. The Governmenthas certain rights in this invention.

TECHNICAL FIELD OF THE INVENTION

The present invention relates to methods for performing surveys of thegenetic diversity of a population. The invention also relates to methodsfor performing genetic analyses of a population. The invention furtherrelates to methods for the creation of databases comprising the surveyinformation and the databases created by these methods. The inventionalso relates to methods for analyzing the information to correlate thepresence of nucleic acid markers with desired parameters in a sample.These methods have application in the fields of geochemical exploration,agriculture, bioremediation, environmental analysis, clinicalmicrobiology, forensic science and medicine.

BACKGROUND OF THE INVENTION

Microbes have been used previously as biosensors to identify chemicalsin the environment. For instance, microbes have been utilized asbiosensors for the presence of nitrates (Larsen, L. H. et al., 1997, Amicroscale NO₃ ⁻ biosensor for environmental applications. Anal. Chem.69:3527-3531), metals (Virta M. et al., 1998, Bioluminescence-basedmetal detectors. Methods Mol. Biol. 102:219-229), and a variety ofhydrocarbons (Sticher P. et al., 1997, Development and characterizationof a whole-cell bioluminescent sensor for bioavailable middle-chainalkanes in contaminated groundwater samples. Appl. Environ. Microbiol.63(10):4053-4060). In these examples however, the indicator microbes arenot native species, but rather, the product of recombinant manipulationsdesigned for specific applications. These modifications involve couplingthe nutrient sensing machinery of well-characterized bacterial strainswith reporter genes to aid identification. This approach is limited,however, by the metabolic diversity of a few well-characterizedbacterial strains. In contrast, the large and diverse pool of microbesin the environment represents a source of biosensors for a much largerrange of applications than currently exists. Thus, there is a need toidentify and use other microbes, especially those found in situ, asbiosensors.

Microbes also have an important impact on health and medicine. Estimateshave been made there may ten times the number of microbial cellsassociated with the human body as there are human cells. Many microbialcell populations that are associated with the human body play abeneficial role in maintaining health. For instance, gut microflora isimportant for proper digestion and absorption of nutrients and forproduction of certain factors, including some vitamins. In general, thehuman immune system is able to keep the bacterial populations of thehuman body in check and prevent the overgrowth of beneficial microbialpopulations and infection by detrimental microbial populations.Nevertheless, the list of human diseases that are now attributed tomicrobial pathogens is growing. However, nearly all of the informationregarding the relationships between microbes and human disease have beengained from approaches that require culture of microbial species.

Two examples of diseases where the causative agents were identifiedthrough molecular methods include bacillary angiomatosis (Relman, D. A.et al., 1990, New Engl. J. Med. 323: 1573) and Whipple's disease(Wilson, K. H. et al., 1991, Lancet 338: 474). Further, the centralaspects of atherosclerosis are consistent with the inflammation thatresults from infection. DNA sequences from Chlamydia have beenidentified from atherosclerotic lesions and has led to suggestions thatthis organism plays a role in the disease.

In addition, bacterial infections have become an increasing healthproblem because of the advent of antibiotic-resistant strains ofbacteria. Further, microbial infections caused by bacteria or fungi thatdo not usually infect humans may be a problem in immunocompromisedindividuals. Further, individuals in developing countries who may bemalnourished or lack adequate sanitary facilities may also support alarge load of opportunistic bacteria, many of which may cause sicknessand disease. In veterinary medicine, livestock living in close quartersalso may be prey to infections caused by a variety of different types ofmicrobes. Thus, there is a need to develop sensitive methods ofidentifying many different types of microbes without having to cultivatethem first in order to treat or prevent microbial infections in humansand other animals.

Assays for microbial contamination is an important component of foodtesting as well. A large number of different types of microbes maycontaminate food for humans or animals. Thus, an ability to test foodfor contamination quickly and effectively is critical for maintainingfood safety. However, many of the microbes responsible for causingsickness in humans and animals are difficult to isolate or identify.

Assays for microbial populations also has use in fields such as forensicscience. Over the past ten to twenty years, scientists have determinedthat microbial populations change when bodies begin to decay, and havebegun to identify certain microbial species that are indicative ofdecomposition (Lawrence Osborne, Crime-Scene Forensics; Dead MenTalking, New York Times, Dec. 3, 2000). However, only a few microbialspecies that may be useful in these analyses has been identified.

The problem of determining genetic diversity is not confined tomicrobial populations. Antibody diversity is critical for a properimmune response. During B cell differentiation, antibody diversity isgenerated in the heavy and light chains of the immunoglobulin bymechanisms including multiple germ line variable (V) genes,recombination of V gene segments with joining (J) gene segments (V-Jrecombination) and recombination of V gene segments with D gene segmentsand J gene segments (V-D-J recombination) as well as recombinationalinaccuracies. Furthermore, somatic point mutations that occur during thelifetime of the individual also lead to antibody diversity. Thus, a hugenumber of different antibody genes coding for antibodies with exquisitespecificity can be generated. T cell receptor (TCR) diversity isgenerated in a similar fashion through recombination between numerous V,D and J segments and recombinational inaccuracies. It has been estimatedthat 10¹⁴ Vδ chains, more than 10¹³ β chains and more than 10¹² forms ofVa chains can be made (Roitt, I. et al., Immunology, 3rd Ed., 1993,pages 5.1-5.14). A knowledge of the antibody or TCR diversity in aparticular individual would be useful for diagnosis of disease, such asautoimmune disease, or for potential treatment.

The identification of microbes, especially soil microbes, hastraditionally relied upon culture-dependent methods, whereby thedetection of a microbial species depends upon the ability to findlaboratory conditions that support its growth. To this end, 96-wellplates have been commercially developed to identify microbes withdifferent metabolic requirements. For instance, BioLog platesincorporate 96 different media formulations into the wells of a 96-wellplate. Despite these efforts, it is now accepted that far fewer than 1%of microbes can propagate under laboratory conditions (Amann, R. I. etal., 1995. Phylogenetic identification and in situ detection ofindividual microbial cells without cultivation. Microbiol. Rev.59:143-169).

The widespread interest in genomics has created many exciting newtechnologies for the parallel quantitation of thousands of distinctnucleic acid sequences simultaneously. While still in their infancy,these technologies have provided unprecedented insight into biology. Todate, these technologies have predominately been utilized inpharmaceutical and agricultural applications. Genome expressionprofiling has gained general acceptance in biology and is likely tobecome commonplace in all academic, biotechnology and pharmaceuticalinstitutions in the 21^(st) century. For instance, Serial Analysis ofGene Expression (SAGE) is a hybridization-independent method designed toquantitate changes in gene expression (Velculescu, V. E. et al., 1995,Serial analysis of gene expression. Science 270:484-487 and U.S. Pat.No. 5,866,330). However, SAGE only measures RNA levels from tissues ororganisms, and is not suitable for examining genetic diversity.

The widespread interest in genomics has also led to the development ofmany technologies for the rapid analysis of tens of thousands of nucleicacid sequences. One such technology is the DNA chip. Although thisapproach had been used as a diagnostic for distinguishing betweenseveral species of the genus Mycobacterium (Troesch, A., et al., 1999,Mycobacterium species identification and rifampin resistance testingwith high-density DNA probe arrays. J. Clin. Microbiol. 37:49-55), ithas limited utility for an environmental microbial survey for tworeasons. First, the sequence of the target DNAs to be analyzed must beknown in order to synthesize the complementary probes on the chip.However, the vast majority of environmental microbes have not beencharacterized. Second, DNA chips rely on hybridization of nucleic acidswhich is subject to cross hybridization from DNA molecules with similarsequence. However, the resolving power of a hybridization-based approachis limited because one must identify regions of DNA that do notcross-hybridize, which may be difficult for related microbial species.

Genomic technologies and bioinformatics hold much untapped potential forapplication in other areas of biology, especially in the field ofmicrobiology. However, to date there has not been a method to rapidlyand easily determine the genomic diversity of a population, such as amicrobial or viral population. Further, there has not been a method toeasily determine the antibody or TCR diversity of a population of B or Tcells, respectively. Thus, there remains a need to develop such methodsin these areas.

BRIEF SUMMARY OF THE INVENTION

The present invention solves this problem by providing methods forrapidly determining the diversity of a microbial or viral population andfor determining the antibody or TCR diversity of a population of B or Tcells. The present invention relies on hybridization-independent genomictechnology to quickly “capture” a portion of a designated polymorphicregion from a given DNA molecule present in a population of organisms orcells. This portion of the DNA molecule, a “marker,” is characteristicof a particular genome in the population of interest. The marker can beeasily manipulated by standard molecular biological techniques andsequenced. The sequence of a multitude of markers provides a measure ofthe diversity and/or identity of a population. In one aspect, theinvention provides a method, Serial Analysis of Ribosomal DNA (SARD),that can be used to distinguish different members of a microbialpopulation of interest.

In another aspect, the invention provides a method for analyzing adesignated polymorphic region from a population of related viruses usingmethod steps similar to those described for SARD. In a further aspect,the invention provides a method for analyzing the variable regions fromthe immunoglobulins or TCR genes of a population of immune cells usingmethods steps similar to those described for SARD.

In another aspect of the invention, a method is provided for analyzing apopulation based upon an array of the masses of peptides that areencoded by polymorphic sequences of particular DNA molecules in a regionof interest. In a preferred embodiment, the region of interest is adesignated polymorphic region from an rDNA gene from each member of amicrobial population.

In another aspect of the invention, a method is provided for analyzingthe information provided by the above-described methods. The methodenables the creation of a diversity profile for a given population. Acollection of diversity profiles provides an accurate representation ofthe members present in a population. These diversity profiles can beentered into a database along with other information about thepopulation. The diversity profiles can be used with various correlationanalyses to identify individual, or sets of individuals that correlatewith each other. The correlation analyses can be used for diagnostic orother purposes. In another aspect, the invention provides databasescomprising various diversity profiles. In a preferred embodiment, thediversity profile is obtained by SARD.

In yet another aspect of the invention, a method is provided foridentifying a diversity profile, as described above, that correlateswith a parameter of interest. In a preferred embodiment, the diversityprofile is a profile of the microbial populations that correlate withthe presence of mineral deposits and/or petroleum reserves. In anotherpreferred embodiment, the diversity profile is a profile of populationsof different antibodies or TCR that correlate with a specific diseasestate, such as an autoimmune disorder.

In a still further aspect, the invention provides a method for locatingmineral deposits or petroleum reserves comprising identifying one ormore nucleic acid markers that correlate with the presence or mineraldeposits or petroleum reserves, isolating nucleic acid molecules from anenvironmental sample, determining whether the nucleic acid markers arepresent in the environmental sample, wherein if the nucleic acid markersare present, then the area from which the environmental sample wasobtained is likely to have mineral deposits or petroleum reserves.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a representation of a 16S rDNA gene from Bacteriaillustrating polymorphic regions (shown as dark bands) and constantregions (shown as light bands).

FIG. 2 shows a schematic representation of the SARD method for isolatingpolymorphic sequence tags from Bacterial rDNA.

FIG. 3 shows a number of representative members of the domain Bacteriawith their taxonomic relationships.

FIG. 4 shows a number of representative members of the domain Archaeawith their taxonomic relationships.

FIGS. 5A and 5B show examples of Marker-Marker correlation scatterplots. Each point represents a single sample population. FIG. 5A shows apositive correlation and FIG. 5B shows zero correlation.

FIGS. 6A-6F show examples of various Marker-Parameter scatter plots.Each point represents a single sample population. FIG. 6A shows zerocorrelation; FIG. 6B shows a curvilinear relation; FIG. 6C shows apositive linear correlation; FIG. 6D shows a negative linearcorrelation; FIG. 6E shows a nonlinear relation; and FIG. 6F shows anonlinear relation.

FIGS. 7A, 7B and 7C shows scatter plots comparing Marker DiversityProfile (MDP) profiles. Each point represents one marker. FIG. 7A:Significant correlation with all markers within the MDP. FIG. 7B: Nocorrelation found using all the markers within the MDP. FIG. 7C: Sameplot as B except that significant correlation is found using a subset ofmarkers.

FIG. 8 shows a schematic for generating a marker diversity profilematrix database. The first step 100 involves assigning N an integervalue of 1 corresponding to the first MDP data structure. The secondstep 105 involves assembling polymorphic markers from a sample. Examplesof methods for assembling such markers are described in this applicationand include, but are not limited to, SARD (FIG. 2, Tables I and II) ormass tag compilation (Table III). The next step 110 involves detectingthe polymorphic markers. Examples of methods to detect polymorphicmarkers described above include DNA sequence analysis of SARD tags andMALDI-TOF analysis of mass tags, respectively. This detection stepincludes detecting the presence and abundance of each marker in thesample. The next step 115 involves the conversion or transduction of theMDP into an electrical signal output. Generally, this process is alinear electronic conversion of the data into a digital signal. Step 120involves detecting parameters that are associated with the sample N.Step 125 involves transducing the sample parameter data, which mayinclude, without limitation, such parameters as pH, grain size,elemental analysis and/or organic analysis, into an electrical signaloutput in the form of a digital signal. Step 130 involves storing intothe memory of a computer the output signal from each MDP into a matrixdata structure and associating it with sample parameters. The next stepis a decision block 135 where if all the data structures have not beencompleted the routine advances to step 140 where n is incremented andthe data structure generating steps 105-130 are repeated for then+1^(th) marker diversity profile. Once all the data structures arecompleted the routine proceed to step 145 to form the MDP database fromn data structures. Each marker is assigned a unique identifier alongwith its relative abundance in the population. This information is alsooptionally indexed with other known parameters that are associated withthe sample including, for instance, the time, date, elevation andgeographical location. These signals are digitized and stored in thememory of a computer.

FIG. 9 shows a schematic of the steps involved in determining parametersassociated with an MDP. A marker diversity profile 200 is created for asample. The marker diversity profile is subject to a comparison function205 which compares the profile with resident marker diversity profilesin the database. Step 210 is a decision block where a query is madewhether the new marker diversity profile equals a resident markerdiversity profile in the database. If the query returns a “Yes” the newmarker diversity profile is deduced 215 to share the same parameters asthe resident marker diversity profile. Since the parameters associatedwith the resident marker diversity profile are characterized, theparameters associate with the new marker diversity profile areidentified.

If the query in decision block 210 returns a “No” the routine proceedsto decision block 220 which queries whether the new marker diversityprofile is a subset of a resident profile in the database. If the queryreturns a “No” the parameters remain undefined 225. If the query returnsa “Yes” the routine proceeds to step 230. Step 230 is an optional stepto determine the correlation between members of the common subset ofmarkers and may either be performed for each new profile or may bequeried from a matrix table of pre-calculated values from existingprofiles. Such values generally would be maintained in a relationaldatabase. If this step is not performed all common markers are parsedinto groups of individual markers and treated as correlated groups 255.If the marker-marker correlation is performed between the common subsetof markers, the routine proceeds to decision block 235 which querieswhether all of the common markers are correlated. If the query returns a“Yes” the markers are correlated with the parameters 240 resident in thedatabase. If none of the markers are correlated with a parameter, theparameter(s) remain undefined 245 whereas if the markers are correlatedwith a parameter, the parameter is deduced to be associated with themarker diversity profile 250. If the decision block query 235 returns a“No”, the common markers are sorted into groups of correlated markers255. The first correlated marker group N 260 is subject to a decisionblock 265 that queries whether the markers in this group are correlatedwith a parameter. A “No” determines that the parameters remainundefined. If the marker(s) are correlated with a parameter, theparameter is deduced to be associated with the marker diversity profile.Steps 260-275 are repeated in steps 280-295 for each correlated group ofmarkers. The groups of correlated markers may be comprised of a singleor multiple markers. The confidence level in deducing that a parameteris associated with a marker diversity profile is determined by the levelof correlation between the marker(s) and the parameter. Therefore, setsof correlated markers are expected to be more robust indicators of anygiven parameter.

FIG. 10 shows a schematic for generating a marker diversity profilematrix database. The first step 300 involves assigning N an integervalue of 1 corresponding to the first MDP data structure. The secondstep 305 involves assembling polymorphic markers from a sample. Examplesof methods for assembling such markers are described in this applicationand include, but are not limited to, SARD (FIG. 2, Tables I and II) ormass tag compilation (Table III). The next step 310 involves detectingthe polymorphic markers. Examples of methods to detect polymorphicmarkers described above include DNA sequence analysis of SARD tags andMALDI-TOF analysis of mass tags, respectively. The next step 315involves the conversion or transduction of the MDP into an electricalsignal output. Generally, this process is a linear electronic conversionof the data into a digital signal. Step 320 involves storing into thememory of a computer the output signal from each MDP into a matrix datastructure associating each MDP with a geographic coordinates such aslongitude and latitude. The next step is a decision block 325 where ifall the data structures have not been completed the routine advances tostep 330 where n is incremented and the data structure generating steps305-320 are repeated for the n+1^(th) marker diversity profile. Once allthe data structures are completed the routine proceed to step 335 toform the MDP database from n data structures. Each marker is assigned aunique identifier and indexed with its relative abundance in thepopulation. These signals are digitized and stored in the memory of acomputer.

FIG. 11 shows a schematic for mapping applications using markerdiversity profiles. Marker diversity profile data 400 can be processedin several ways to create maps that provides significant environmentalinformation. In one example 405, each marker diversity profile in adatabase can be correlated with every other marker diversity profile ina pairwise manner to create a correlation matrix. By appending this datato the geographical coordinates of each sample 410, a map can beconstructed that reflects the correlation values of physicallyneighboring sample sites. Preferably the correlation values will becolor coded to reflect the level of correlation. The color is chosenfrom a reference color spectrum that is indexed to correlation valuesbetween 0-1.

Marker diversity profiles 400 can also be processed into maps at theindividual marker or correlated marker group level. This approach ispreferable since subsets of markers are likely to correlate to fewernumber of sample associated parameters. Each marker in a markerdiversity profile database is correlated with every other marker in thedatabase in a pairwise manner to create a correlation matrix 415. Thesource database can either be composed of marker diversity profiles froma single geographic area or several distinct areas. In step 420, themarkers from one geographic area are sorted into groups based upon theirlevel of correlation. In step 425, the relative representation of thecorrelated marker group N is determined along with its geographicalcoordinates for each marker diversity profile in a geographic area. Amap is constructed 430 where the relative abundance of each correlatedmarker group is color-coded with its geographical coordinates. Steps 425and 430 are repeated as in 435 and 440 for each correlated group ofmarkers.

FIG. 12 shows oligonucleotides useful for amplifying nucleic acidmolecules for SARD.

FIGS. 13A and 13B show the use of the SARD strategy for Eubacteria. Thedouble-underlined sequence and the wavy-underlined sequence representthe sequence tags for the two pools and the single-underlined sequencedelineates the BpmI recognition site.

FIG. 14 is a graphical representation of a SARD analysis of a definedpopulation.

FIG. 15 shows the sequence of SARD tags identified from Wy-1 sample. Thenumber in parentheses indicates the number of tags having that sequence.

FIG. 16 shows SARD tags identified from Wy-2 sample. The number inparentheses indicates the number of tags having that sequence.

FIGS. 17A and 17B are graphical representations of the number andabundance of SARD tags. FIG. 17A shows the Wy-1 SARD Tag DiversityProfile and FIG. 17B shows the Wy-2 SARD Tag Diversity Profile.

DETAILED DESCRIPTION OF THE INVENTION

The extent of the diversity of microbes in our environment has onlyrecently been recognized. With the advent of the polymerase chainreaction (PCR) and small subunit ribosomal DNA (rDNA) sequence analysis,researchers have been able to detect and perform phylogenetic analyseson individual microbes without first cultivating the microbes ofinterest. This molecular phylogenetic approach has significantly changedour view of microbial evolution and diversity (Woese, C. R., 1987,Bacterial evolution. Microbiol Rev. 51(2):221-71; Pace, N. R., 1997, Amolecular view of microbial diversity and the biosphere. Science.276(5313):734-40). For instance, the earliest life forms are now thoughtto have utilized inorganic compounds for nutrition rather than compoundsbased upon organic carbon. In addition, the vast proportion ofbiological diversity is now known to be due to microbial species.Estimates have been made that there may be more than ten thousanddistinct species of microbes in a single gram of soil. FIGS. 3 and 4show some of the representative members of the domains Bacteria andArchaea, respectively, that may be found in environmental samples.

Microbes inhabit virtually all niches including extreme environmentswith temperatures between 20° and 250° F. Microbes have even beenisolated from deep petroleum reservoirs more than a mile beneath theearth's surface (Jeanthon, C. et al., 1995, Thermotoga subterranea sp.nov., a new thermophilic bacterium isolated from a continental oilreservoir. Arch. Microbiol. 164:91-97). In order to prevail under suchdiverse conditions, microbes have made remarkable adaptations and haveattained the ability to utilize unusual carbon and mineral resourcesthat are immediately available. These physiological and metabolicadaptations that enable some microbes to inhabit a particular niche mayalso restrict their distribution to such areas. Numerous examples ofenvironmental parameters that lead to restrictions of microbialdistribution are well known and are usually dictated by a species'specific metabolic program (e.g. obligate nature of the carbon, nitrogenand energy source).

Microbes that have highly defined nutrient requirements are likely tohave a restricted distribution in the environment. Thus, the microbes'dependence on the presence of a particular resource to proliferate canserve as the basis for an assay to identify the presence, andcharacterize the distribution, of various features in the environment,such as biological, chemical and geochemical features. In other words,microbes can function as environmental biosensors.

In one aspect of this invention, the ability of microbes to function asenvironmental biosensors is used to identify particular environmentalstates. In a preferred embodiment, a profile of a microbial populationis used to identify one or more parameters of a particular environmentalstate. In a more preferred embodiment, a microbial population profile isused to identify areas that are likely to have mineral deposits and/orpetroleum reserves. In another preferred embodiment, a microbialpopulation profile is used in forensic science to identify decompositionof a body or to associate an individual to another individual, to anobject or to a location. In yet another preferred embodiment, amicrobial population profile is used to identify microbial contaminationof human and animal foodstocks. In yet another preferred embodiment, theprofile is used to diagnose human or animal disease.

DEFINITIONS

Unless otherwise defined, all technical and scientific terms used hereinhave the meaning as commonly understood by one of ordinary skill in theart to which this invention belongs. The practice of the presentinvention employs, unless otherwise indicated, conventional techniquesof chemistry, molecular biology, microbiology, recombinant DNA, geneticsand immunology. See, e.g., Maniatis et al., 1982; Sambrook et al., 1989;Ausubel et al., 1992; Glover, 1985; Anand, 1992; Guthrie and Fink, 1991(which are incorporated herein by reference).

A microbe is defined as any microorganism that is of the domainsBacteria, Eukarya or Archaea. Microbes include, without limitation,bacteria, fungi, nematodes, protozoans, archaebacteria, algae,dinoflagellates, molds, bacteriophages, mycoplasma, viruses and viroids.

A marker is a DNA sequence that can be used to distinguish or identify aparticular gene, genome or organism from another. In one embodiment, amarker may be generated by one of the methods described herein. A markerrepresents one or a limited number of taxonomic species or genes. In apreferred embodiment, a marker represents a single taxonomic species orgene. In one embodiment, the marker represents a single microbialspecies. In another embodiment, the marker represents a single viralspecies or type. In another embodiment, the marker represents a singleimmunoglobulin or TCR variable domain.

A marker diversity profile (MDP) is a data set that is obtained fromeach population sample and that contains a collection of markers. In apreferred embodiment, the MDP also comprises other information,including all known parameters associated with a particular populationsample. Such parameters that relate to environmental samples may includeinorganic components (obtained through atomic adsorption analysis),organic components (obtained through GC-MS or LC-MS), grain sizeanalysis, pH, and salinity. Parameters that relate to medical sampleswould include, but are not limited to, a complete medical history of thedonor. In a preferred embodiment, the markers are obtained by SARD.

Methods for the Genetic Analysis of Populations

Ribosomes, which are comprised of numerous ribosomal proteins and threeribosomal RNA (rRNA) molecules, are a key component of proteinsynthesis. The 16S subunit rRNA, which is encoded by the 16S rDNA gene,has been the focus of much attention in microbial phylogenetic studies.The 16S rDNA sequence is highly conserved between taxonomic groups, yetalso possesses regions that are highly polymorphic (FIG. 1). Moreover,the rate of change in the RNA sequence is thought to have beenrelatively constant over evolutionary time, enabling scientists todetermine the relative relatedness of different organisms.

Typical molecular microbial analyses involve utilizing the highlyconserved regions of the 16S rDNA to amplify the roughly 1,500 bp gene.The sequence of the PCR-amplified product is determined and comparedwith other known rDNA sequences. Although this approach is highlyinformative, it is not amenable to a rapid survey of an environmentalmicrobial community.

The instant invention provides methods for quickly and easilycalculating the genetic diversity of a population. The methods useshybridization-independent genomic technologies to overcome thepreviously-identified problems of determining genetic diversity. Thismethod may be used for any population of cells, viruses or organismswhich comprise at least one DNA molecule that comprises regions of highsequence conservation interspersed with polymorphic sequences, whereinthe polymorphic sequences can be used to distinguish different membersof the population of interest. One aspect of the present inventiondescribes a method (SARD) that can capture a designated polymorphicregion from a given DNA molecule present in the members of a microbialcommunity. In a preferred embodiment, the DNA molecule is a 16S rDNAmolecule. In another embodiment, the DNA molecule is the intergenicregion between the 16S and 23S rDNA genes. In another embodiment, themethod is used to identify the genetic diversity of a population ofviral samples or of cells or organisms infected with a population ofviruses. In another embodiment, the method is used to identify thediversity of immunoglobulin and/TAR genes in a population of B and/or Tcells.

The method may be performed as follows (see FIG. 2):

Step I. Sample Preparation and DNA Amplification by PCR

Samples may be obtained from any organism or region desired. Forenvironmental microbial analyses, samples may be obtained from, withoutlimitation, buildings, roadways, soil, rock, plants, animals, cell ortissue culture, organic debris, air or water. For medical microbialanalyses, samples may be obtained from, without limitation, humans,animals, parasites, water, soil, air and foodstuffs. For viral analyses,samples may be obtained from, without limitation, viral culture stocks,humans, animals, plants, cell or tissue culture and microbes. Forimmunoglobulin or TCR analyses, samples may be obtained from, withoutlimitation, humans, animals or cell or tissue cultures. DNA moleculesfrom the sample of interest may be isolated by any method known in theart. See, e.g., Sambrook et al., 1989 and Ausubel et al., 1992. In apreferred embodiment, DNA is obtained as described by Yeates et al.,“Methods for Microbiological DNA Extraction from Soil for PCRAmplification,” Biological Procedures Online, Volume 1, May 14, 1998,available through the Internet; Liu et al., Applied and EnvironmentalMicrobiology (1997) 63: 4516-4522; and Tsai et al., Applied andEnvironmental Microbiology (1992) 58: 2292-2295. The DNA molecules donot have to be completely purified but only need be isolated to thepoint at which PCR may be performed.

Environmental microbes often exist in biofilms (Costerton, J. W., etal., 1999, Bacterial biofilms: a common cause of persistent infections.Science 284(5418):1318-1322) or in tight association with solidsurfaces. Microbial DNA from a sample of interest is isolated by one ofseveral methods that are widely known to those skilled in the art andare described in the literature (Gillan, D. C. et al., 1998, Geneticdiversity of the biofilm covering Montacuta ferruginosa (Mollusca,bivalvia) as evaluated by denaturing gradient gel electrophoresisanalysis and cloning of PCR-amplified gene fragments coding for 16SrRNA. Appl. Environ. Microbiol. 64(9):3464-72).

The samples may be selectively enriched before they are isolated by anymethod known in the art. In one embodiment, for a population of microbesthat are known or suspected to feed on a hydrocarbon source such aspropane, the hydrocarbon may be added to the environment in which themicrobes live for a period of time before the microbes are harvested. Inanother embodiment, a viral population may be cultivated in cells beforethey are isolated. In a further embodiment, B and T cells may beexpanded in culture before isolation. It may be easier to obtainsufficient sample amounts if a population is expanded before isolation.However, this must be weighed against the possibility that expansionwill alter the ratio of different members of the population to eachother.

In general, the primers used for amplification are designed to hybridizeto a region of the DNA that is highly conserved between members of thepopulation. Further, the primers should flank a polymorphic region thepartial sequence of which should provide diagnostic informationregarding the genetic diversity of the population. For instance, for the16S rDNA gene, primers are designed to hybridize to a highly conservedregion of the 16S rDNA gene flanking a polymorphic region (see FIG. 2).In an immunoglobulin gene, the primers are designed to hybridize to aregion of the B cell DNA flanking the V-J recombination site.Alternatively, primers may be designed that bind to the relativelyconstant regions within certain regions of the V-J gene that flank apolymorphic region. See Roitt et al., Immunology, 3rd Ed., 1993, pp.5.2-5.14, herein incorporated by reference, which shows regions ofvariability and conservation within immunoglobulin and TCR genes. Onehaving ordinary skill in the art following the teachings of thespecification will recognize that other genes that have regions that arehighly conserved between members of the population and that flankpolymorphic regions may be used in the design of primers.

The primers should also be designed to flank a region of DNA thatcomprises a restriction site for a restriction enzyme. In a preferredembodiment, the restriction enzyme is a cuts at a four-basepairrecognition site, such as AluI (see FIG. 2). Furthermore, therestriction site should be near but not in the polymorphic region of thegene of interest. In a preferred embodiment, the restriction site shouldbe one that is present in the gene of interest in a majority of theknown species or genes.

A single set of primers may be used or multiple sets of primers may beused. A single set of primers may be used if it is known that the regionof DNA to which the primers will bind is very highly conserved.Alternatively, if it is known that there is some variation in theconserved region, multiple sets of primers may be used to bind to theconserved DNA region. Using multiple sets of primers may be useful toidentify more members of a population, especially those members of thepopulation that exhibit less sequence identity in the conserved areas ofa nucleic acid sequence. In one embodiment, four to ten sets of primersmay be used to identify members of a population. Alternatively, theprimers used may be degenerate, such that different molecules within aprimer population will include a different base at one or more specificsites in the primer. For instance, a primer may have a site that haseither cytosine or thymidine. The purpose of making primers degenerateis to increase the number of different DNA molecules that will hybridizeto a particular primer. Methods of making degenerate primers arewell-known in the art.

The primers used in this and in subsequent steps are generally of alength sufficient to promote specific hybridization to a DNA sequence.The primers generally have a length of at least 12 bases, morepreferably at least 15 bases, even more preferably at least 18 bases.The primers may have a length of up to 60 bases, although usually mostare under 40 bases in length. Primers may include both bases that arenaturally found in DNA, such as adenine, guanine, cytosine andthymidine, and may also include nucleotides that are not usually foundin DNA, such as inosine.

One of the primers (the “upstream” primer) should be modified toincorporate a moiety that can be used to bind the PCR product to a solidsupport. The upstream primer is defined as the primer that is located onthe opposite side of the polymorphic region of interest relative to theflanking four-base restriction site. A number of different bindingmoieties are known in the art. In a preferred embodiment, the moiety isbiotin. In another preferred embodiment, the moiety is digoxigenin orsix histidines.

PCR is performed using the primers to amplify a subregion that containsa polymorphic site of interest. Methods for performing PCR arewell-known in the art. In one embodiment, the PCR products arenormalized or subtracted by methods known in the art to lower therepresentation of the dominant sequences. Exemplary methods aredescribed in Sambrook et al., 1989, Ausubel et al., 1992; Glover, 1985;Anand, 1992 (which are incorporated herein by reference).

Step II. Digestion of the Amplified Fragment and Binding to SolidSupport

The amplified fragment is cut with the restriction enzyme as discussedin Step I. Any restriction enzyme may be used in this step so long as itis cuts at a site immediately adjacent to the polymorphic sequence. In apreferred embodiment, the enzyme is a four-base restriction enzyme.Examples of four-base restriction enzymes are well-known in the art andinclude many that are commercially available. See, e.g., New EnglandBiolabs Catalog 2000, herein incorporated by reference. Examples offour-base restriction enzymes include, without limitation, AluI,Bsh1236I, DpnI, HpaII, MboI, MspI, PalI (an isoschizomer of HaeIII),RsaI, Sau3AI and TaqI. After restriction, the DNA fragment is bound to asolid support. Numerous solid supports to immobilize DNA are known inthe art. Examples include, without limitation, streptavidin beads, whichwould bind to a PCR product labeled with biotin, and anti-digoxigeninbeads, which would bind to a PCR product labeled with digoxigenin, andbeads conjugated to nickel, which would bind to a six-histidine labeledproduct. In a preferred embodiment, streptavidin beads are used (FIG.2).

Since the SARD tag position is dictated by the first restriction enzymerecognition site distal to the biotinylated primer used in the initialPCR reaction, there may be cases in which the first restriction enzymerecognition site is located within a conserved region of the gene ofinterest. In general, this will not be a problem because even though thetags from the conserved region may not be informative, most tags derivedby SARD will be from a polymorphic region and will be informative.However, if one desires to decrease the number of tags that containinformation from a conserved region of a gene rather than from apolymorphic region, one may purify the desired PCR products afterrestriction. In a preferred embodiment, one may do this by gel purifyingthose PCR products that have the expected size.

Step III. Digestion of the Amplification Product and Ligation to Linkers

The immobilized products are split into two pools and linkers areattached to the immobilized products of each pool. Each linker is adouble-stranded synthetic DNA molecule comprising a specific DNAsequence. Both linkers incorporate a Type IIS restriction enzyme site.In a preferred embodiment, the two linkers incorporate the same Type IISrestriction enzyme site. Each of the two linkers also comprises a DNAsequence that specifically hybridizes to a primer. In one embodiment,the linkers are identical to one another and hybridize to the sameprimer. In a preferred embodiment, the linkers are different from eachother such that each hybridizes to a different primer.

The double-stranded linker is ligated to the immobilized PCR product.The linker may incorporate the Type IIS restriction enzyme site or itmay incorporate only a portion of the site. In this case, the linkerwill be designed such that ligation of the linker to the restricted DNAwill reconstitute the Type IIS restriction site. In a preferredembodiment, the linker incorporates a BpmI site. Linker ligation iswell-known and may be accomplished by any method known in the art. Afterligation, the immobilized PCR product is isolated from the free linkersby any method known in the art. See, e.g., Velcelescu et al., Science270: 484-487, 1995; Powell, Nucleic Acids Research 14: 3445-3446, 1998;Sambrook et al., pp. F.8-F.10, 1989.

Type IIS restriction enzymes cleave at a defined distance up to 20basepairs away from their asymmetric recognition sites. Type IISrestriction enzymes that are commercially available include enzymes thatleave 5′ overhangs and those that leave 3′ overhangs as double-strandedDNA products. Some enzymes of the former class include: BsmFI (10/14),Bst71I (8/12), and FokI (9/13), where the number in parentheses indicatethe cleavage position on the same DNA strand as the recognitionsequence/cleavage position on the complementary DNA strand. Enzymes ofthe latter class include: BpmI (16/14), BsgI (16/14), Eco57I (16/14) andGsuI (16/14). The 3′ overhang left by these enzymes must be removed fora blunt ligation (Step IV). Therefore, enzymes that cleave at positions16/14 result in a 14 base-pair tag. Other enzymes that cut at a moredistal position could create a larger tag. For instance, MmeI (20/18)leaves a 3′ overhang, but is not commercially available (Tucholski, J.et al., 1995, MmeI, a class-IIS restriction endonuclease: purificationand characterization. Gene 157: 87-92).

Step IV. Digestion of the Product with Type IIS Restriction Enzyme

The product is digested with the appropriate Type IIS restriction enzymeto release a DNA fragment from the anchoring bead and produce a shorthybrid DNA fragment containing a portion of the polymorphic region ofthe DNA of interest (the tag) and the linker DNA. After digestion, theDNA must be either filled in or digested to create blunt ends. If theType IIS restriction enzyme produces a 3′ overhang, the fragment isdigested with T4 DNA polymerase to remove the 3′ overhang. If the TypeIIS restriction enzyme produces a 5′ overhang, the overhang must befilled in using the appropriate deoxynucleotides and the Klenow fragmentof DNA polymerase I. The DNA fragment is separated from the rest of theimmobilized PCR product. In a preferred embodiment, the two pools ofimmobilized PCR products are digested with BpmI to release thepolymorphic markers and digested with T4 DNA polymerase to create bluntends (FIG. 2).

Step V. Ligation of Tags and PCR Amplification of Resulting Ditags

The tags are blunt-end ligated to one another using methods well-knownin the art to form ditags. See, e.g., Sambrook et al., 1989; Velcelescuet al., Science 270: 484-487, 1995. The ditags are subsequentlyamplified by PCR using primers that are unique to the linkers used inStep III. In a preferred embodiment, the primers are different from oneanother if the linkers used in Step III were different from one another.Alternatively, the primers may be the same if the linkers used in StepIII were identical to one another. The number of PCR amplificationreactions will vary depending upon the amount of DNA present in thestarting material. If there is a large amount of DNA, then only one PCRamplification reaction, wherein each reaction comprises fromapproximately 15-30 cycles, will be required at this step. If thestarting amount of DNA is low, then more than one PCR amplificationreaction may be required at this step.

Step VI. Cleavage of the Ditags and Ligation to Form Ditag Concatemers

The ditags are cleaved with the four-base restriction enzyme used inStep II. The products are then ligated to create ditag concatemers. Inone embodiment, the ditag concatemers range from 2 to 200 ditags. In amore preferred embodiment, the concatemer comprises 20-50 polymorphictags. The concatemer may be sequenced directly, or may be cloned into asequencing vector. Using a 96-channel capillary DNA sequencer, about12,000 tags could be easily analyzed in one day. Alternatively, theconcatemers may be sequenced manually.

Methods to Analyze Marker Data

The invention is directed toward methods of analyzing the geneticdiversity of a population in a sample. Each population that is analyzedwill have its own unique set of different organisms or genes. The dataset that is captured from each sample should recapitulate the geneticstructure in a survey format to include a marker for each gene ororganism and the relative abundance of each gene or organism in thepopulation as a whole. The markers for a particular population form amarker diversity profiles (MDPs), that may be entered into a database.See, e.g., FIG. 8 which shows one schematic for generating such adatabase. The method by which the data are captured is not critical aslong as it produces an accurate representation of each population.

In one aspect of the method, MDPs are entered into a database. In apreferred embodiment, the database is kept in a computer-readable form,such as on a diskette, on a non-removable disk, on a network server, oron the World Wide Web. However, the method by which the data arecaptured is not critical as long as it produces an accuraterepresentation of each microbial community.

Artificial intelligence (AI) systems can perform many data managementand analysis functions. Examples of AI systems include expert systemsand neural networks. Expert systems analyze data according to aknowledge base in conjunction with a resident database. Neural networksare comprised of interconnected processing units that can interpretmultiple input signals and generate a single output signal. AI systemsare ideally suited for analyzing complex biological systems includingpopulations through the use of deduction protocols.

A marker may be correlated with a particular condition or with anothermarker. See, e.g., FIG. 9 for a schematic of the steps involved indetermining particular parameters associated with an MDP and FIG. 10,which shows a schematic for generating a marker diversity matrixdatabase. A condition or state may be an environmental condition such aspH, temperature, salinity, or the presence or absence of an organic orinorganic compound such as hydrocarbons, nitrates or mineral deposits. Acondition may be a physiological or medical condition such as an acuteor chronic disease state, physiological state, developmental stage orassociated with a particular body tissue or fluid. Information regardingall known parameters associated with the samples will also be savedtogether with the MDPs.

Each MDP is composed of markers which represent a small number, morepreferably one, species or gene. For instance, in the case of Example 1,each marker would be comprised of a 12 base-pair polymorphic 16S rDNAsequence. Such parameters that relate to environmental samples mayinclude inorganic components (obtained through atomic analysis), organiccomponents (obtained through GC-MS or LC-MS), grain size analysis, pH,and salinity. Parameters that relate to medical samples would include,but are not limited to, a complete medical history of the donor. See,e.g., FIG. 11, which shows a schematic for mapping applications usingmarker diversity profiles.

In another aspect of the invention, MDPs are collected for a timecourse, and each time point is one of the parameters included. Timecourses may be useful for tracking changes over time for a wide varietyof indication. For example, time courses may be useful for tracking theprogression of a disease, during environmental remediation efforts, andduring oilfield production.

In another aspect of the invention, MDPs are collected in variousdistinct locations, such as in various geographical locations or invarious tissues of the body. Comparison of MDPs compiled from variousdistinct locations are useful for distinguishing changes between thesevarious locations, which may be indicative of particular environmentalconditions or disease states.

Comparison of marker diversity profiles can reveal trends in populationseither relative to time or to geographical location. In the latter case,comparisons of microbial populations can resolve spacial informationabout the environment that would otherwise be undetected. Examples ofsuch information include migration patterns of water, organic compoundsand minerals. For instance, placer deposits of minerals are caused bythe action of both water and wind causing the minerals to migrate from alode deposit at one location only to deposited at another location. Themigration of such minerals may leave a detectable trace upon themicrobial populations in the path of migration. Physical attributes ofthe environment could also be detected such as structures, formationsand fault lines. It is commonly understood that faults offer asignificant vertical migration route for gases such as methane which isknown to be differentially utilized by microbes. By combining MDP datawith geographical coordinates such as elevation, longitude and latitudethat can easily be obtained with global positioning system devices, itis possible to create maps delineating the distribution of variousmicrobes in the environment.

Correlation analyses between one marker and all the other markers in thedatabase will reveal pairs of markers that have a propensity tocoincide. This process can be repeated in an iterative manner for allmarkers to produce a matrix of correlation coefficients between allobserved markers. FIG. 5 shows a scatter plot for two pairs of markerswith one of the pairs exhibiting a high degree of correlation. Thisapproach can also be used to create a dendrogram that reflects therelative level of correlation between each marker. Therefore, at anychosen level of correlation, all of the observed markers can be dividedinto groups where the markers in each group share the same level ofcorrelation with each other member of the group. If a high correlationcoefficient value is chosen (e.g. 0.8), the markers of each group would,more often than not, be found in the same sample. Thus, this exercisewill divide a given population into groups of genes or organisms thathave a propensity to co-localize with each other. In one preferredembodiment, the exercise will divide a microbial community into groupsof microbes that have a propensity to co-localize.

Correlation analysis between a marker (variable 1) and a sampleparameter (variable 2) will identify markers whose presence often, orinvariably, coincides with a component present in the samples. Sometypes of relationships between markers and sample components (orparameters) are shown in FIG. 6. A strong correlation value between amarker and sample parameter would allow predictions to be made about theabundance of either variable (marker or sample parameter) as long as oneof the variables is known.

In some cases, a marker will not be specific for a single species orgene. For example, the tag sequence that would be identified by theapproach depicted in Example 1 would be identical for Denitrobacterpermanens and Legionella anisa. In the cases where a significantcorrelation is found between a marker and sample parameter of interest,the preferred action is to use the tag sequence information to identifythe complete gene sequence. The sequence can then be used to identifythe species and to identify species-specific probes to verify thecorrelation. One may do this using methods known in the art, such as byPCR or by hybridization to DNA molecules isolated from the sample ofinterest, followed by sequencing or other method of analysis.

Species-specific probes that are identified from markers with a robustcorrelation to a sample parameter of interest can then be utilized as adiagnostic, or to prospect for the parameter of interest. Such assayswould preferably be PCR-based and would be highly sensitive, rapid andinexpensive. In a preferred embodiment, a marker identified by thesemethods may be used as a hybridization probe to identify a larger pieceof the DNA from which the marker is derived. The sequence of the largerDNA molecule can then be used to design primers that will specificallyhybridize to the DNA molecule of interest and which can be used tospecifically amplify the DNA molecule by PCR. Alternatively, one may usea hybridization-based assay using a probe that binds specifically to theDNA molecule of interest. Using specific primers or probes areespecially useful for quickly determining whether a large number ofsamples contains the DNA molecule that correlates to the parameter ofinterest.

In a preferred embodiment, a marker that correlates with a desiredparameter is identified. The marker may be identified using SARD, or maybe identified using another method, such as restriction fragment lengthpolymorphisms (RFLP) or terminal restriction fragment lengthpolymorphisms (T-RFLP; Liu et al., Applied and EnvironmentalMicrobiology 63: 4516-4522, 1997). A method such as denaturing gradientgel electrophoresis (DGGE) may be used to identify size differences. Ina preferred embodiment, SARD is used to identify the marker. Othersamples are screened to determine whether they have the marker ofinterest. In a preferred embodiment, the screen used is PCR orhybridization, more preferably PCR. In an even more preferredembodiment, the marker correlates with the presence of mineral depositsor petroleum reserves.

Correlation analysis between MDPs (MDPn, variable 1; MDPn+1, variable 2)can reveal the relative similarities between samples. Samples taken fromthe same individual or from proximal environmental sites that havesimilar composition, are expected to show a robust correlationcoefficient (FIG. 7A). However, samples that share only one or a fewparameters in common are expected not to show a significant correlationwhen all of the markers are considered (FIG. 7B). By incorporating theknowledge learned from the Marker-Marker and Marker-Parametercorrelations, MDPs can be compared using either individual markers orpreferably, subsets of correlated markers in the analysis (FIG. 7C).This approach can eliminate much of the noise and enable one to identifyhidden relationships.

Correlation analyses may be performed by any method or calculation knownin the art. Correlation analyses for r and r² may be performed asdescribed by M. J. Schmidt in Understanding and Using Statistics, 1975(D. C. Health and Company), pages 131-147. The degree of correlation forr may be defined as follows:

1.0 Perfect 0.8-0.99 High 0.5-0.7 Moderate 0.3-0.4 Low 0.1-0.2Negligible

In one embodiment, the correlation between two markers or between amarker and a parameter is at least low (r is 0.3-0.4). In a preferredembodiment, the correlation is at least moderate (r is 0.5 to 0.7). In amore preferred embodiment, the correlation is high (r is 0.8 to 0.99).

With the development of numerous genomic technologies for analyzingcomplex sets of nucleic acids, we have the opportunity to begin tocatalog the reservoir of microbial, and hence, metabolic diversity.Since the proliferation of a microbe in a given location will dependupon the presence of the requisite metabolic nutrients, information asto the abundance of that microbe can serve as a biosensor for a givenset of parameters. When viewed as a whole, the microbial communitystructure in a given location will hold intrinsic biosensor potentialfor a wide range of parameters. The predictive reliability of the datafrom a complete microbial community will also be significantlyincreased. For example, if a given microbe were present in 50% of soilsamples taken above petroleum reservoirs and were found nowhere else,then the presence of ten such microbes would create a predictive valuewith 99.9% accuracy.

Applications of the Invention Geochemical and Mineral Exploration

The methods described in this invention have several benefits overexisting technologies. For instance, in the area of geochemicalexploration, genomic rDNA-based assays potentially will be able toresolve an extensive set of geochemical parameters of interest to thepetroleum and mining industries. Currently, many different technologiesare required to measure these parameters. Because this invention isbased upon a universal measure, nucleic acid detection, it can greatlyreduce instrumentation and sample outsourcing costs.

Oil and gas reservoirs are located well beneath the earth's surface atdepths from a few hundred feet to more than 10,000 feet. When oil isformed, it undergoes a migration in which one of two things take place.The oil may continue to migrate until it ultimately reaches the surface,where it evaporates over time. Alternatively, its migration may beblocked by an impermeable structure, a so-called “trap”. Geophysicalmethods (such as three-dimensional seismic methods) for petroleumexploration relies on finding these trap structures with the hope thatthey contain oil.

Crude oil is made up of a variety of hydrocarbon chain lengths. Thelightest hydrocarbons (namely methane, ethane, propane and butane) areoften able to diffuse through the trap structures and, as a result ofpressure gradients, undergo a vertical migration to the surface. Certainmicrobes present at the surface or in the surface layer are able toutilize these migrating hydrocarbons, which occasionally results inmineralogical changes that are detectable at the surface. Thus, thesemigrating hydrocarbons would be expected to affect microbialpopulations, such that the ability to determine the genetic diversity ofa microbial population may reveal microbial signatures that areindicative of the presence of oil.

Recent advances in microfluidics in the genomics industry have resultedin the development of instruments that can detect specific nucleic acidswithin a few minutes. Utilizing such instruments will enablemeasurements to be made in the field for a variety of parameters. Incontrast, conventional chemical assays require laboratory analysis andinterpretation.

Biosensors have been created that are able to detect hydrocarbonspresent at, or below, the level of detection of sophisticated GC-MSanalytical instrumentation (Sticher, P. et al., 1997, Development andcharacterization of a whole-cell bioluminescent sensor for bioavailablemiddle-chain alkanes in contaminated groundwater samples. Appl EnvironMicrobiol. 63:4053-4060). Sticher et al. demonstrated that by usingsingle reporter gene in a genetically engineered microbe comprising areporter gene was able to sense extremely small changes in theirenvironment in response to an acute treatment with a particularhydrocarbon. The instant invention could document the effect upon apopulation comprising thousands of microbes over geologic time and thus,has the potential of being more sensitive than current analyticalinstruments.

This invention may also be used to create a survey of biologicalentities that is limited only by the prerequisite that these entitiescontain nucleic acids that are arranged in regions that are conservedand regions that are polymorphic when compared to sequences from relatedorganisms. Some additional examples of the application of this inventionare described below.

Oil and Gas Reservoir Development

In addition to the application of this invention in petroleumexploration, this invention could also be useful in the development ofoil and gas reservoirs. Several properties of oil reservoirs thatdirectly affect the commercial viability of the reservoir are modulatedat some level by microbes. Hydrogen sulfide is sometimes present incrude oil and can render otherwise ‘sweet’ oil into ‘sour’ oil. Inaddition to its corrosive effect on oilfield equipment, H₂S also posesrisk to the workers and significantly reduces the value of an oilreservoir since a washing plant must be installed to remove the gas. Thelevels of H₂S can change during the development of a reservoir and isnow thought to be the result of sulfate-reducing bacteria (Leu, J.-Y. etal., 1999, The same species of sulphate-reducing Desulfomicrobium occurin different oil field environments in the north sea. Lett. Appl.Microbiol. 29(4):246-252). By identifying the presence of microbes thatcould lead to H₂S production, the valuation of new reservoirs and theresulting developmental strategies could be made more effective.

Crude oil and natural gas are composed of a complex mixture ofhydrocarbons including straight chain hydrocarbons of lengths generallybetween 2-40 carbon atoms. The shorter chain-length hydrocarbons aremore valuable (e.g. gasoline, C₄-C₁₀). In some oil reservoirs, thelighter hydrocarbons are selectively removed either during or prior todevelopment of the reservoir. Microbes have been suspected to play arole in this process since the shorter chain-length hydrocarbons aremore bioavailable. This invention could identify microbes that areinvolved in this process and therefore make predictions as to thesusceptibility of certain reservoirs to the depletion of short chainhydrocarbons. This invention may also be able to identify microbescapable of shortening long chain hydrocarbons thereby increasing thevalue of existing reservoirs.

Insect and Parasite Detection

The significant negative impact insects can have on agriculture iswidely known. Insects can also serve as vectors for the transmission ofmany disease causing microbes. Numerous microbe-insect relationshipshave been described. For example, the bacterial genus Wolbachia is foundassociated with many species of ants and has been shown to alter sexdetermination and fecundity in the host (Wenseleers, T. et al., 1998,Widespread occurrence of the microorganism Wolbachia in ants. Proc. R.Soc. Lond. B. Biol. Sci. 265(1404):1447-52). In addition, manyintracellular endosymbiotic bacterial species have been identified inants (Schroder, D. et al., 1996, Intracellular endosymbiotic bacteria ofCamponotus species (carpenter ants): systematics, evolution andultrastructural characterization. Mol. Microbiol. 21:479-89). Most, ifnot all, insects probably have species-specific intimate relationshipswith microbes which could represent could represent an Achilles heel forthe control of insect populations. The invention described in thisapplication could provide a means to identify microbes that modulate thewell-being of a given insect species.

The identification of microbes that are specifically associated with agiven insect could also potentially serve as a the basis for a highlysensitive test for the presence of the insect. For instance, currentmethods to identify the presence of termites in wooden structures isbased on visual inspection and is largely inadequate. A test for thepresence of a termite-associated microbe that is based onPCR-amplification would be both non-invasive and highly sensitive.

Further, the ability to create comprehensive inventories of microbialdiversity has several applications that relate to microbial ecology thatwould have utility in the agricultural industry. For instance, theagriculture industry utilizes enormous amounts of pesticidesprophylactically to prevent loss of crops. This invention provides theability to perform comprehensive surveys of microbial populations andcould lead to predictions as to the susceptibility of a given field toparticular plant pathogens. This knowledge could lead to a betterstrategy of pesticide applications.

Further, surveying microbial diversity in an environment such as aagricultural field at various times of the year would reveal the dynamicchanges in microbial populations that occur as a result of seasonalfluctuations (temperature and moisture), pesticide application and theproliferation of certain organisms.

Comparison of the diversity profiles taken from the same or similar siteat different times would reveal interactions between species in apopulation. These productive interactions may manifest themselves eitherin the increase or decrease in the representation of one marker relativeto the decrease or increase in the representation of a second marker,respectively.

Such information could provide an early warning of the proliferation ofa pest species as well as the identification of species that arepathogenic to a pest species. These pathogenic organisms may have valueeither as a biological agent to control proliferation of pest speciesand/or as a source for genes or compounds that would act as pesticides.This phenomenon is not be limited to microbe-microbe interactions. Theeggs as well as larval stages of insects likely interact with soilmicrobes and create a detectable impact upon microbial populations. Anexample of such an organism is Bacillus thuringensis, which itself iscommonly used in organic farming as an insecticidal agent. The generesponsible for this insecticidal activity (Bt toxin) has been widelyused to create transgenic plants with resistance to insect attack.

Bioremediation

A considerable amount of effort has gone into the development of methodsfor microbe-based removal of chemicals from the environment. Suchchemicals include heavy metals, polynuclear aromatics (PNAs),halogenated aromatics, crude oil, and a variety of other organiccompounds such as MTBE. Regulatory considerations for the release ofmicro-organisms into the environment have re-directed efforts towardsidentifying and augmenting the growth of endemic organisms that have thecapability to metabolize or remove compounds of interest from theenvironment.

The present invention can facilitate bioremediation efforts in two ways.First, organisms can be identified at a given site that have either beenpreviously shown to be capable of removing compounds of interest, orthat have significant likelihood of having the capacity metabolize therelevant compounds based upon its coincidence in the environment withthe chemical in numerous geological settings. Secondly, this inventioncan identify trends of soil types and particular microbial species. Inthis case, the correlations that are drawn from the database betweenmicrobial distribution and soil types can be coupled with the existingknowledge base of geochemistry. For instance, the USGS provides manypublically available maps describing numerous geochemical andgeophysical parameters. Extrapolations of the distribution of microbialspecies can be made to the regional, and possibly worldwide, level.These extrapolated microbial distributions could serve as the basis forsite-specific treatment regimens to augment the growth of certainrelevant species without first performing a microbial survey.

Immunological Applications

Methods could be applied to map immunoglobulin and TCR generearrangements, which occur normally during B and T celldifferentiation. These rearrangements could provide a profile of anindividual's immunoglobulin and TCR diversity that could be correlatedwith medical history. This type of analysis might lead to the earlydiagnosis of certain conditions or diseases and allow for a moreproactive and early treatment. Further, samples of immune cells could beisolated from individuals or certain body fluids or tissues to identifypotential immunoglobulin or TCR profiles that may be correlated withparticular diseases, particularly autoimmune diseases. For instance, ithas been demonstrated that T cells expressing particular TCR subtypesare found in higher levels in the synovial fluid of individualssuffering from rheumatoid arthritis. The methods of the instantinvention may be used to identify other correlative immunoglobulin orTCR profiles in autoimmune and other diseases.

Virus Detection

The human genome contains thousands of copies of human endogenousretroviruses (HERVs) that make up as many as 1% of the human genome(Sverdlov, E. D. 1998, Perpetually mobile footprints of ancientinfections in human genome. FEBS Lett. 428:1-6). These sequences arethought to be remnants of infections that occurred millions of yearsago. These sequences can transpose to other locations in the humangenome and may be responsible for disease-susceptibility in certainhuman populations (Dawkins, R. et al., 1999, Genomics of the majorhistocompatibility complex: haplotypes, duplication, retroviruses anddisease. Immunol. Rev. 167:275-304). This invention could be used tosurvey HERV polymorphic sequences and determine whether they correlatewith a variety of clinical parameters.

Forensic Science

This invention, particularly SARD analysis, can be used in forensicapplications as well. Studies over the past ten to twenty years havefocused upon the changes that occur in a person's or animal's body afterdeath. Many of these changes involve changes in microbial populationsthat occur during decomposition of the body. Changes in microbialpopulations have been correlated with length of time that a person hasbeen dead and the conditions that the body experienced after death,e.g., heat, sun exposure, partial or complete burial, rain, etc. SARDanalyses would permit forensic scientists to quickly and accuratelydetermine the size and type of microbial populations, which in turn maybe used to determine more accurate times of death as well as conditionsthat the body may have been exposed to.

Other Applications

This approach can be utilized for any polymorphic region in a genome,whether microbial, viral or eukaryotic that is flanked by conserved DNAsequences. This method also need not be restricted to genes. The DNAsequence of intergenic regions of genomes are not under as high a levelof selective pressure and thus, represent highly polymorphic DNAsequence. One example of such a region is the intergenic region betweenthe large (23S) and small (16S) rDNA subunits coding regions. Theabove-described methods may be used to distinguish members of apopulation based upon size differences of the intergenic region betweenthe 16S-23S or 23S-5S rDNA genes. The spacer region between these geneshas been found to be hypervariable in microbial populations. In the caseof the 16S-23S intergenic region, the spacer size ranges between about200-1500 base-pairs depending the presence or absence of various tRNAgenes. (Noor M. 1998, 16S-23S and 23S-5S intergenic spacer regions oflactobacilli: nucleotide sequence, secondary structure and comparativeanalysis, Res. Microbiol. 149(6):433-448; Berthier F. et al., 1998,Rapid species identification within two groups of closely relatedlactobacilli using PCR primers that target the 16S/23S rRNA spacerregion, FEMS Microbiol Lett. 161(1):97-106; Tilsala-Timisjarvi A. etal., 1997, Development of oligonucleotide primers from the 16S-23S rRNAintergenic sequences for identifying different dairy and probioticlactic acid bacteria by PCR Int. J. Food Microbiol. 35(1):49-56). Bothof these rDNA genes are transcribed on the same operon. Therefore, theconserved regions of the rDNA coding sequence of these subunits can beutilized to amplify the intergenic regions.

In order that this invention may be better understood, the followingexamples are set forth. These examples are for purposes of illustrationonly and are not to be construed as limiting the scope of the inventionin any manner.

Example 1 Serial Analysis of rDNA Polymorphic Tags from the DomainEubacteria

A sample comprising environmental bacteria was obtained and total DNAwas extracted from the sample. To amplify the DNA, PCR was performed bymixing 5 μL 10× Advantage2 reaction buffer (Clontech), 2.5 μL dNTPs, 5μL 8 μM TX9/TX16 primers, 50 ng of sample DNA, 0.5 μL Advantage2 Taqpolymerase (Clontech) and water to 50 μL. Primer TX9 was biotinylated.The primers are shown in FIG. 12 and a general strategy of this exampleis shown in FIG. 13. The reaction mixture was then subjected to PCRunder the following conditions:

-   -   a) 94° C. for 5 minutes;    -   b) 94° C. for 1 minute, 10 seconds;    -   c) 55° C. for 50 seconds;    -   d) 68° C. for 1 minute;    -   e) repeat steps (b) through (d) 20 times; and    -   f) 68° C. for 2 minutes.

The approximately 600 basepair PCR product was gel purified using 1%agarose and a Qiaquick kit (Qiagen) according to manufacturer'sinstructions. The DNA was eluted with 50 μL Tris-EDTA (TE; 10 mM Tris pH8; 1 mM EDTA). To restrict the amplified DNA, 10× New England BiolabsBuffer #2 (NEB#2) was added to bring the concentration of the buffer to1× and 25 units of AluI was added. The reaction mixture was incubatedfor 2 hours at 37° C.; 25 additional units of AluI was added and thereaction mixture was incubated for a further one hour followed byinactivation of the enzyme at 65° C. for 20 minutes.

In order to immobilize the restricted DNA fragment to a bead, an equalvolume of 2×BW buffer (2×BW=10 mM Tris pH 7.5; 1 mM EDTA; 2 M NaCl) wasadded to the reaction mixture, followed by addition of 50 μL washedM-280 Streptavidin (SA) beads (Dynal). The reaction mixture wasincubated for 20 minutes at room temperature. The beads were then washedtwice in 1×BW and twice in wash buffer (10 mM Tris pH 8; 10 mM MgSO₄; 50mM NaCl). During the last wash step, the beads were split into twopools.

To add the linkers to the immobilized DNA, one pool of beads wereresuspended in 4 μL of 10 μM linker TX-12/13 (a double-stranded DNAmolecule comprising primers TX-012 and TX-013) to form pool A, while theother pool (pool B) were resuspended in 4 μL of 10 μM linker TX-14/15 (adouble-stranded DNA molecule comprising primers TX-014 and TX-015) wasadded to the other pool. 36 μL of T4 ligase mix (4 μL 10× ligase buffer;32 μL water; 0.2 μL T4 ligase [400 U]) were added to the linker/beadmixture and the mixture was incubated overnight at 16° C.

The DNA molecule attached to the streptavidin beads was then incubatedwith BpmI, which recognizes its specific restriction sequence (GAGGTC)in the DNA molecule. BpmI cleaves the DNA such that it releases thestreptavidin bead from the DNA molecule and incorporates a part of thepolymorphic region of the DNA molecule first amplified. The beads werewashed twice with 0.5 mL each 1×BW and twice with wash buffer. The beadswere resuspended in 10 μL BpmI mix (1 μL 10×NEB#3 [New England Biolabs];1 μL 1 μg/μL bovine serum albumin; 8 μL water and 1 μL BpmI [New EnglandBiolabs]). The reaction mixture was incubated at 37° C. for 2 hours andthen at 65° C. to inactivate the enzyme. The supernatant containing theDNA tags was then isolated.

In order to remove the 3′ overhang on the DNA tags and make them bluntended, to the supernatant (10 μL) on ice was added 10 μL T4 polymerasemix (1 μL 10×NEB#2; 0.5 μL 4 mM dNTPs; 8.5 μL water and 0.33 μL T4polymerase [1 U; New England Biolabs]). The reaction was incubated at12° C. for 20 minutes and at 65° C. for 20 minutes to inactivate thepolymerase. In order to form ditags, pool A and pool B were recombinedto give a total volume of 40 μL. Four μL 10 mM rATP and 0.2 μL T4 DNAligase (400 U) was added and the reaction was incubated 4 hours toovernight at 16° C.

As an intermediate amplification step, the ditags were then amplified ina 300 μL PCR reaction (30 μL 10× Advantage2 Taq buffer; 15 μL 4 mMdNTPs; 30 μL 8 μM TX111/TX121 primer mix; 3 μL Advantage2 Taqpolymerase; 3 μL ditag template; 219 μL water). The 300 μL reaction mixwas split into three 100 μL reactions and amplified using the followingconditions:

-   -   a) 94° C. for 5 minutes;    -   b) 94° C. for 30 seconds;    -   c) 56° C. for 30 seconds;    -   d) 68° C. for 40 seconds;    -   e) repeat steps (b) through (d) for 15 cycles; and    -   f) 68° for 2 minutes.        After amplification, three volumes (900 μL) QG buffer (Qiagen)        and one volume (300 μL) isopropanol was added, the mixture was        bound to a Gel Extraction Spin Column (Qiagen) and the DNA was        eluted with 50 μL TE.

The amplified DNA was then subjected to non-denaturing polyacrylamidegel electrophoresis (PAGE; Novex 1 mm 10% TBS-PAGE gel). The gel wasstained with 5 μg/mL ethidium bromide, the approximately 106 basepairditags were excised from the gel, the excised gel was fragmented and 0.3mL of TE was added to the gel and incubated at 65° C. for 30 minutes.The gel/TE mixture was transferred to a miniprep spin column (Qiagen)and the eluate containing the amplified ditags was collected (the ditagtemplate). This amplified DNA was called the P300 ditag template.

In order to determine the optimal number of PCR cycles for large-scaleamplification (PCR cycle titration), 6.4 μL P300 ditag template wasmixed with 8 μL 10× Advantage2 Taq buffer, 4 μL 4 mM dNTPS, 8 μLTX111/TX121 primers, 0.8 μL Advantage2 Taq polymerase and 52.8 μL water.The reaction mixture was split into three 25 μL reactions and amplifiedfor 6, 8 or 10 cycles using the PCR conditions described above for theintermediate amplification step. After amplification, the DNA productswere desalted and purified using a Qiagen Gel Extraction Spin Column asdescribed above except that only 75 μL QG buffer and 25 μL isopropanolwere used, and the DNA products were eluted with 20 μL TE. The cyclenumber that produced the largest 106 basepair ditag yield without anydetectable vertical smearing by 10% TBS-PAGE analysis was used forlarge-scale amplification.

For large-scale ditag amplification, 180 μL 10× Advantage2 Taq buffer,90 μL 4 mM dNTPs, 180 μL 8 μM TX111/TX121 mixture, 18 μL Advantage2 Taqpolymerase, 144 μL P300 ditag template and 1188 μL water was mixedtogether and then split into 18 100 μL reactions. PCR was performed asdescribed immediately above using the number of cycles that weredetermined from the previous titration step. The PCR reactions weredesalted by adding three volumes QG buffer and one volume isopropanol tothe reactions and passed through a total of four Qiagen Gel ExtractionSpin Columns. Each column was eluted with 50 μL TE. The samples weresubjected to non-denaturing preparative PAGE (Novex 1.5 mm 10%TBS-PAGE). The DNA bands were stained and excised as above and splitinto three tubes. The polyacrylamide was fragmented and eluted using TEas described above and the DNA was purified by passing the contents ofthe gel/TE mixture through a Qiagen Plasmid Spin Miniprep column asdescribed above.

To precipitate the DNA, 1 μL glycogen, 150 μL 10 M ammonium acetate and1125 μL ethanol was added to each eluate of approximately 300 μL. Themixture was incubated at −80° C. for 20 minutes and microfuged at 13,000rpm at 4° C. for 15 minutes. The pellet was washed with 1 mL 70% ethanoland dried at 37° C. for 5 minutes.

The DNA pellets were resuspended in 150 μL AluI mix (15 μL NEB#2, 135 μLwater and 15 μL AluI [150 U]). The reaction mixture was incubated at 37°C. for 2 hours. 150 μL of 2×BW buffer was added and the sample wastransferred to a new tube containing 150 packed SA beads. The beads wereincubated with the DNA mixture for 30 minutes at room temperature andthe beads were pelleted magnetically. The supernatant was removed,extracted with 250 μL phenol/chloroform/isoamyl alcohol, and the aqueousphase was precipitated with 150 μL 10M ammonium acetate and 1125 μLethanol overnight at −80° C. The DNA was centrifuged at 13,000 rpm at 4°C. for 15 minutes, the pellet washed with 70% ethanol and dried. Thisstep removes any free linkers that were present as well as anyincompletely digested DNA products that were still bound to thebiotinylated linkers.

In order to concatenate the ditags, the DNA pellet was resuspended in 10μL ligase mix (1 μL 10× ligase buffer; 9 μL water and 1 μL T4 ligase[100 U]) and incubated at 16° C. for 30 minutes. The DNA wasprecipitated by added 40 μL TE, 25 μL 10 M ammonium acetate and 180 μLethanol. The DNA was precipitated for 15 minutes at −80° C. andcentrifuged, washed and dried as described above.

The concatenated ditags were purified by resuspending the pellet in1×TBE loading buffer and subjecting the sample to non-denaturing PAGE(8% TBE-PAGE). The area between 0.5 to 1.2 kb was excised and the DNAwas eluted from the gel into 300 μL TE as described above. The DNA wasprecipitated using 1 μL glycogen, 150 μL 10M ammonium acetate and 1125μL ethanol. The DNA was precipitated, centrifuged, washed and dried asbefore.

The concatenated ditags were cloned in pUC19 by resuspending the DNApellet in 3 μL SmaI-cut pUC 19 (approximately 100 ng plasmid DNA) andadding 7 μL T4 ligase mix (1 μL 10× ligase buffer, 6 μL water and 0.2 μLT4 ligase [400 U]). The plasmid/ditag mixture was incubated for 2 hoursat 16° C. and 1 μL of the mixture was used to transform 100 μLchemically-competent DH10B. Amp-resistant transformants were screened byPCR using pUC/m13 forward/reverse 17mers as PCR primers. Transformantscontaining concatemerized ditags were then sequenced.

Example 2 SARD Analysis of a Defined Population

SARD was performed essentially as described in Example 1. In thisexample, commercially available bacterial genomic DNA samples were mixedat an equal concentration (weight/volume). The bacterial DNA samplesused included Bacillus subtilis, Clostridium perfringens, Escherchiacoli, Lactococcus lactis and Streptomyces coelicolor. Equal volumes ofthe DNA samples (total genomic DNA) were mixed at a concentration of 50ng/μL each. Chart I shows the size of the genomes of each bacterialspecies, the number of 16S rDNA copies per genome and the molarpercentage of 16S copies for each bacterial species in the total DNAsample.

CHART I Bacteria Genome (Mb) 16S Copies/Genome Molar % B. subtilis 4.210 17.2 C. perfringens 4.4 10 16.4 E. coli 4.6 7 11.0 L. lactis 2.4 1030.7 S. coelicolor 8.0 6 5.4

After SARD, 120 tags were sequenced. Chart II shows the expected numberand percentage of tags compared to the observed number and percentage oftags for the population. See also FIG. 14.

CHART II Expected Observed Expected Observed Tag Tag Tag Bacteria TagNumber Number Percentage Percentage B. subtilis 21 44 17.2% 36.7% C.perfringens 20 18 16.4% 15.0% E. coli 13 8 11.0% 6.7% L. lactis 37 3530.7% 29.2% S. coelicolor 6 6 5.4% 5.0%

Each of the rDNA genes from these species produced a SARD tag that wasdistinguishable from the other members of the set. As can be observed,approximately twice as many tags that corresponded to 16S rDNA from B.subtilis were found than was expected based upon the molar percentage of16S rDNA from B. subtilis. The observation that B. subtilis appeared tobe twice as abundant as was expected has been reported previously.Farrelly et al. (Farrelly, V. F. et al., 1995, Effect of genome size andrrn gene copy number on PCR amplification of 16S rRNA genes from amixture of bacterial species. Appl. Environ. Microbiol. 61(7):2798-2801)described similar results when PCR amplifying the 16S rDNA gene frommixed populations of genomic DNA. They concluded that this phenomenonwas the result of the tandem organization of the rrn operons in the B.subtilis genome where multiple rDNA genes may be amplified as a singleproduct. The remaining tags were found at abundances close (<40%deviation) to their expected values.

Example 3 SARD Analysis of Environmental Bacterial Diversity

In order to demonstrate that the SARD method could be used to surveyenvironmental bacterial diversity, total DNA was extracted from two soilsamples (Wy-1 and Wy-2) taken from the Rocky Mountain Oilfield TestingCenter (RMOTC, Casper, Wyo.) in October, 2000. The samples werecollected about 0.5 miles apart and from a depth of 14-18 inches. Theenvironmental DNA samples were subjected to SARD analysis as describedin Example 1. In a preliminary analysis, 148 tags were identified fromWy-1 and 234 tags were identified from Wy-2 (FIGS. 15 and 16,respectively).

In the Wy-1 sample, 58 distinct tags were identified and the abundanceof each tag varied. The most abundant tag (ATGGCTGTCGTCAGCT) (SEQ ID NO:6) made up about 34% of the population. This tag sequence is identicalto many bacterial sequences in GenBank and its position within the 16SrDNA gene indicates that it is located in a conserved region locateddistal to the targeted AluI restriction site. In other words, thecontributing 16S gene(s) for this tag did not contain the conserved AluIsite. Since the SARD tag position is dictated by the first AluI sitedistal to the biotinylated primer used in the initial PCR reaction, itis likely that the first AluI site in the contributing 16S gene(s) waslocated downstream within a conserved region. In order to decrease thenumber of tags that do not contain the conserved AluI site next to thepolymorphic region, one may gel purify the approximately 100 basepairPCR products after the first AluI restriction step. However, this mayresult in losing some information. Nevertheless, 39% of the tags(58/148) in this set were different from each other. See FIGS. 15 and17.

The Wy-2 sample was found to contain 79 different tags out of a total of234 tags that were examined Thus, 34% of the tags (79/234) in this setwere different from each other. See FIGS. 16 and 17. As in the case withWy-1, the tag ATGGCTGTCGTCAGCT (SEQ ID NO: 6), which represents aconserved sequence in a 16S rDNA gene, was most abundant and made upabout 30% of the population.

Combining the tags from the two sets reveals a total of 105 differenttags from a total of 382 tags. Thus, 26 of 58 different tags (45%) inWy-1 were not present in Wy-2. Likewise, 47 of 79 different tags (59%)in Wy-2 were not present in Wy-1. The tags that were only found in oneof the samples are candidates for bacteria with indicator value forvarious parameters associated with each sample. However, there was noattempt in this preliminary analysis to obtain all of the tags presentin the two samples, so it cannot be concluded that some or most of thetags found in one sample are not present in the other sample. Thus, onecannot conclude that there are tags in these two samples that areindicators for various parameters associated with each sample.Nonetheless, a full-fledged analysis of these samples may provide suchindicators.

Example 4 Serial Analysis of rDNA Polymorphic Tags From the DomainArchaea

The method described in Example 1 can also be applied to the domainArchaea. The domain Archaea is made up of two known kingdoms,Euryarchaeota and Crenarchaeota (Pace, N. R. 1997, A molecular view ofmicrobial diversity and the biosphere. Science 276:734-740). One set ofoligonucleotides and restriction enzymes can be used to survey both ofthese domains.

In this example, the following oligonucleotides are designed: 5′biotin-TA(CT)T(CT)CCCA(GA)GCGG(CT)(GCT)(GC)(GA)CTT(AGCT)-3′ (SEQ ID NO:155) corresponding to position 817-838 of the Methanococcus jannaschii16S rDNA gene (GenBank Accession number M59126), and(5′-GGTG(TGC)CA(GC)C(CA) GCCGCGGTAA(TC)ACC(AGCT)-3′ (SEQ ID NO: 156)corresponding to position 457-481 of the Methanococcus jannaschii 16SrDNA gene.

In this example, a BfaI site (CTAG) is utilized that corresponds toposition 768-771 of the Methanococcus jannaschii 16S rDNA gene. Thissite is immediately flanking a polymorphic region.

The SARD method was tested in silico using 17 representatives from thedomain Archaea (FIG. 4). 15 of the SARD tags that would be identifiedwere all unique to this set (Table II). Two species of the genusMethanobacterium did not possess any BfaI sites in the region that wouldbe amplified and therefore, would not produce any tags.

Example 5 Surveying PCR-Amplified Polymorphic rDNA Regions

Oligonucleotide primers that are complementary to conserved regionsimmediately flanking a polymorphic region could be used to amplify thepolymorphic DNA sequences. The oligonucleotide primer sequences couldinclude existing or introduced restriction sites to enable subsequencecloning into a bacterial vector. By utilizing or introducing differentrestriction sites in each primer, the restriction digested PCR productscould be concatemerized in a unidirectional fashion prior to cloning.This step would allow a serial sequence analysis of multiple polymorphicregions from a single recombinant product.

Example 6 Surveying Translation Products of 16S rDNA Polymorphic Regions

The polymorphic regions of 16S rDNA could also be surveyed by theparallel identification of the translation products of a givenpolymorphic region. Oligonucleotides could be designed such that theycan serve to amplify a polymorphic region. The 5′ primer would alsoinclude a T7 polymerase binding site or other polymerase binding site, aKozak consensus sequence, an initiator ATG codon and an epitope tofacilitate purification. Examples of epitopes include hemagglutinin(HA), myc, Flag or polyhistidine. Following amplification, the productsare subjected to in vitro transcription/translation to produce thepeptide products. These peptides are purified from the cell extract andanalyzed by mass spectrometry. This type of an approach has been appliedto the identification of mutations in the BRCA1 gene (Garvin, A. M. etal., 2000 MALDI-TOF based mutation detection using tagged in vitrosynthesized peptides. Nature Biotechnol. 18:95-97).

Although this process would not readily provide DNA sequence informationfrom which to deduce taxonomy, it would allow for the creation ofmicrobial diversity profiles comprised of ‘mass tags’. These mass tagscould be used to identify correlations between specific tags and varioussample parameters. To test the information content of this approach, apolymorphic region of the 16S rDNA genes from the species in FIG. 3 wastranslated in silico. Of the 34 polymorphic regions examined, 32produced a tag with a unique mass in this set (Table III).

Another approach to translating the amplified polymorphic regions wouldbe to clone and express the sequences in whole cells. For instance,oligonucleotides could be designed to amplify polymorphic regions thatinclude sequences at the 5′ ends that would allow for cloning byhomologous recombination in yeast. These sequences could be cloned intoan expression cassette to create a fusion between a secreted protein,such as alpha factor or invertase, and an epitope to facilitatepurification and the translated amino acid sequence of the rDNApolymorphic region. Homologous recombination in yeast is quite robustand could easily enable the isolation of 10³-10⁴ independentrecombinants on a single transformation plate. The secreted productscould be isolated from the medium and identified by mass spectrometry.

Example 7 Hybridization of Microbial rDNA to ImmobilizedOligonucleotides

A complication with using short oligonucleotide probes in DNAmicroarrays is the instability of short oligonucleotides duplexes. Apossible solution to this problem is to synthesize probes that includedegenerate sequences to accommodate unknown sequences. The length of theoligonucleotide is dictated by the number of degeneracies, or sequencepermutations, that are to be accommodated. For instance, a fullydegenerate 9mer oligonucleotide requires 4⁹ or 262,144 differentoligonucleotide sequences. The efficient hybridization of 9meroligonucleotides is not possible using standard conditions. One solutionhas been to incorporate a ligation step in the hybridization of a targetsequence to a 9mer oligonucleotide probe (Gunderson, K. L. et al., 1998,Mutation detection by ligation to complete n-mer DNA arrays. Genome Res.8:1142-1153).

Another solution could be to effectively increase the length of theoligonucleotide by including nucleotides in the primer that are notdegenerate. This approach could be applied to a survey of a microbialcommunity by constructing oligonucleotide probes that are composed of aconstant region that corresponds to a well conserved region of a 16SrDNA gene together with a degenerate sequence that corresponds to theflanking polymorphic region.

An example of such a collection of degenerate oligonucleotides for thedomain Bacteria could include permutations of the following primer:5′-AACGAGCGCAACC-3′ (SEQ ID NO: 157), where N indicates any nucleotideat that position. This sequence corresponds to position 1101-1122 of theE. coli 16S rDNA gene (GenBank Accession number E05133). Alternatively,the primers could be designed such that they are composed of a mixtureof constant sequence, semi-degenerate positions (e.g. A or G) anddegenerate positions (e.g. A, G, C or T).

All publications and patent applications cited in this specification areherein incorporated by reference as if each individual publication orpatent application were specifically and individually indicated to beincorporated by reference. Although the foregoing invention has beendescribed in some detail by way of illustration and example for purposesof clarity of understanding, it will be readily apparent to those ofordinary skill in the art in light of the teachings of this inventionthat certain changes and modifications may be made thereto withoutdeparting from the spirit or scope of the appended claims.

TABLE I* Species GenBank Acc# Tag Sequence Position SEQ ID NO:Desulfurobacterium thermolithotrophum AJ001049 GTCAGTTGCCGAAGCT 814-829  158 Uncultured Aquificales OPS132 AF027104 GTCCGTGCCGTAAGCT 810-825  159 Bacteroides caccae X83951

1021-1036 160 Actinomyces bovis X81061 TTTCCGCGCCGTAGCT  834-849  161Actinomyces meyeri X82451 TTTCTGCGCCGTAGCT  828-843  162Denitrobacterium detoxificans AF079507 CCTCCGCGCCGCAGCT  788-803  163Uncultured GNS bacteria BPC110 AF154084 CCCGGTAGTCCTAGCT  765-780  164Uncultured GNS bacteria GCA004 AF154104 CATCGGTGCCGCAGCT  824-839  165Uncultured GNS bacteria GCA112 AF154100 CGGCGGTGCCGTAGCT  826-841  166Acetobacter aceti AF127399 ACTCAGTGTCGTAGCT  782-797  167Gluconobacter asaii AB024492 ACTCAGTGTCGAAGCT  783-798  168Burkholderia sp. JB1 X92188 CCTTAGTAACGAAGCT  837-852  169Denitrobacter permanens Y12639

 789-804  170 Desulfobacter curvatus M34413 CTGCTGTGCCNAAGCT  861-876 171 Desulfobulbus sp. BG25 U85473 CCTCTGTGTCGCAGCT  854-869  172Legionella anisa X73394

 790-805  173 Benzene mineralizing clone SB-1 AF029039

1029-1044 174 Escherichia coli E05133 CGTGGCTTCCGGAGCT  848-863  175Uncultured Acidobacterium Sub.Div-1 X68464 CCGCCGTGCCGAAGCT  813-828 176 Uncultured Acidobacterium Sub.Div-1 Z73363 CGGCTGTGCCGAAGCT 521-536  177 Uncultured Acidobacterium Sub.Div-1 Z73365CCACTGTGCCGTAGCT  521-536  178 Uncultured Acidobacterium Sub.Div-1Z73368 CTGCTGTGCCGCAGCT  521-536  179Uncultured Acidobacterium Sub.Div-1 Z73364 CTGCCGTGCCGGAGCT  521-536 180 Uncultured Acidobacterium Sub.Div-1 U68659 CCAATGTGCCGGAGCT 319-334  181 Uncultured Acidobacterium Sub.Div-1 D26171CCGTCGTGCCGTAGCT  779-794  182 Uncultured Acidobacterium Sub.Div-1X97101 CCGTCGTGTCGTAGCT  687-702  183Uncultured Acidobacterium Sub.Div-1 X97098 CTGCCGTGTCGAAGCT  798-813 184 Uncultured Acidobacterium Sub.Div-1 AF047646 CTCCCGTGTCGAAGCT 779-794  185 Uncultured Acidobacterium Sub.Div-1 AF050548CCGCCGTGCCGGAGCT  316-331  186 Uncultured Acidobacterium Sub.Div-2U68612 CTGAGGAACGAAAGCT  226-241  187Uncultured Acidobacterium Sub.Div-2 Y07646 GTGTCGTCCCGGAGCT  830-845 188 Uncultured Acidobacterium Sub.Div-3  X97097 GGGCTGTGCCGAAGCT 804-819  189 Uncultured Acidobacterium Sub.Div-3  X68466GGTCGGTGCCGGAGCT  796-811  190 Uncultured Acidobacterium Sub.Div-3 X68468 GGTCGGTGCCAGAGCT  796-811  191Uncultured Acidobacterium Sub.Div-3  U68648 GGTTCGTGCCGGAGCT  317-332 192 Uncultured Acidobacterium Sub.Div-3  X68467 TGTCTGTGCCGGAGCT 796-811  193 Uncultured Acidobacterium Sub.Div-3  AF013515TATCCGTGCCGGAGCT  799-814  194 Uncultured Acidobacterium Sub.Div-3 AF027004 GGTCCGTGCCGGAGCT  778-793  195 *Sequences shown in bold withshadow indicates they are not unique to this set.

TABLE II Species GenBank Acc# Tag Sequence Position SEQ ID NO:Crenarchaeota Aeropyrum pernix D83259 CTAGGGGGCGGGAG 614-627 196Desulfurococcus mobilis M36474 CTAGGTGTTGGGTG 856-869 197Staphylothermus marinus X99560 CTAGGTGTTGGGCG 770-783 198Metallosphaera sedula X90481 CTAGGTGTCGCGTA 756-769 199Sulfolobus acidocaldarius D14053 CTAGGTGTCGAGTA 785-798 200Sulfolobus metallicus D85519 CTAGGTGTCACGTG 744-757 201Caldivirga maquilingensis AB013926 CTAGCTGTTGGGTG 773-786 202Pyrobaculum islandicum L07511 CTAGCTGTCGGCCG 781-794 203 EuryarchaeotaArchaeoglobus fulgidus X05567 CTAGGTGTCACCGA 780-793 204Archaeoglobus veneficus Y10011 CTAGGTGTCACCGG 758-771 205Haloarcula japonica D28872 CTAGGTGTGGCGTA 762-775 206Halococcus morrhuae D11106 CTAGGTGTGGCGTT 765-778 207Nlethanococcus jannaschii M59126 CTAGGTGTCGCGTC 768-781 208Methanobacterium bryantii AF028688 None Methanobacterium subterraneumX99045 None Pyrococcus abyssi Z70246 CTAGGTGTCGGGCG 767-780 209Picrophilus oshimae X84901 CTAGCTGTAAACTC 742-755 210

TABLE III Species GenBank Acc# Tag Sequence M.W. Position SEQ ID NO:Desulfurobacterium thermolithotrophum AJ001049 RAQPLSLVASG* 1097.401079-1136 211 Uncultured Aquificales OPS132 AF027104 RAQPLSCVTSG*1117.40 1074-1131 212 Bacteroides caccae X83951 RAQPLSSVTNRSC* 1417.701069-1126 213 Actinomyces Bovis X81061 RAQPLSRVASTLWWGLAGD 2083.601088-1145 214 Actinomyces meyeri X82451 RAQPLPYVASTLWWGLVGD 2128.601082-1139 215 Dendrobacterium detoxificans AF079507 RAQPLPHVASIRLGTHGG1866.50 1039-1094 216 Uncultured GNS bacteria BPC110 AF154084RAQPLLYVIRVIPD 1652.10 1074-1116 217 Uncultured GNS bacteria GCA004AF154104 RAQPSLYVTRIIRD 1687.10 1080-1122 218Uncultured GNS bacteria GCA112 AF154100 RAQPSPYVIRVIRD 1669.00 1082-1124219 Acetobacter aceti AF127399 RAQPLSLVASMFGWAL* 1746.30 1038-1095 220Gluconobacter asaii AB024492 RAQPLSLVASTFRWAL* 1815.30 1034-1092 221Burkholderia sp. JB1 X92188 RAQPLSLVATQEHSRET 1922.20 1094-1144 222Dendrobacter permanens Y12639 RAQPLPLVATFSWAL* 1669.10 1077-1131 223Desulfobacter curvatus M34413 RAQPLSLVASTLCGNSNET 1960.40 1116-1172 224Desulfobulbus sp. BG25 U85473 RAQPLPLVASSSAGHSKGT 1863.40 1114-1170 225Legionella anisa X73394 RAQPLSLVAST* 1141.40 1078-1135 226Benzene mineralizing clone SB-1 AF029039 RAQPLPLVANRSSWGL* 1764.201077-1134 227 Escherichia coli E05133 RAQPLSFVASGPAGNSKET 1916.301103-1159 228 Uncultured Acidobacterium Sub. Div-1 Z73363RAQPLSLVASGSSRAL* 1612.10 775-832 229Uncultured Acidobacterium Sub. Div-1 Z73365 RAQPLSSVAIGSSRATLAK 1912.50777-835 230 Uncultured Acidobacterium Sub. Div-1 Z73368 RAQPLFASCHH*1933.50 779-835 231 Uncultured Acidobacterium Sub. Div-1 Z73364RAQPLFAQLPSFSWALCRN 2204.80 778-835 232Uncultured Acidobacterium Sub. Div-1 U68659 RAQPLLPXAII* 1218.70 573-630233 Uncultured Acidobacterium Sub. Div-1 D26171 RAQPLLPVATI* 1177.501035-1090 234 Uncultured Acidobacterium Sub. Div-1 X97101 RAQPLSPVAII*1163.50 943-998 235 Uncultured Acidobacterium Sub. Div-1 X97098RAQPLSSVATI* 1141.40 1054-1109 236 Uncultured Acidobacterium Sub. Div-1AF047646 RAQPLFLVATI* 1227.60 1035-1090 237Uncultured Acidobacterium Sub. Div-1 AF050548 RAQPSSLVANTLW* 1441.70572-629 238 Uncultured Acidobacterium Sub. Div-2 U68612RAQPLHVVATRKRELYVD* 2150.60 577-630 239Uncultured Acidobacterium Sub. Div-2 Y07646 RAQPLHVVATPQGGTLRG 1857.301085-1140 240 Uncultured Acidobacterium Sub. Div-3 X97097RAQPSSLVANPQGKHPKGT 1972.40 1060-1116 241Uncultured Acidobacterium Sub. Div-3 U68648 RARPLSCVAII* 1197.70 574-629242 Uncultured Acidobacterium Sub. Div-3 AF013515 RAQPLSCVANPQGCTLRR1969.50 1057-1112 243 Uncultured Acidobacterium Sub. Div-3 AF027004RAQPSPCVATPPRAGALSGD 1950.40 1036-1096 244 *Indicates an in-frame stopcodon was encountered within the polymorphic sequence.

1-44. (canceled)
 45. A method for constructing a marker diversityprofile (MDP) database comprising the steps of: a) sequencing at least148 rRNA markers from a sample, wherein said sample comprises amicrobial population, and wherein each rRNA marker comprises a microbialrRNA gene polymorphic sequence; b) determining the abundance in saidsample of each rRNA marker that is sequenced in step a); c) transducingthe abundance of each of the sequenced rRNA markers into an electricaloutput signal; d) storing the plurality of electrical output signals ina matrix data structure and associating in said matrix data structureeach electrical output signal with the corresponding sequence of therRNA marker from whose abundance the electrical output signal wastransduced; e) designating the plurality of electrical output signalscorresponding to the plurality of marker abundances from said sample andbeing stored in the matrix data structure as an MDP; f) repeating stepsa-e for at least one other sample, and designating the plurality of MDPsas an MDP database.
 46. A method for constructing a marker diversityprofile (MDP) database comprising the steps of: a) sequencing aplurality of rRNA markers from a sample, wherein said sample comprises amicrobial population, and wherein each rRNA marker comprises a microbialrRNA gene polymorphic sequence, and assigning to each marker an uniqueidentifier; b) determining the abundance in said sample of each rRNAmarker; c) transducing the abundance of each rRNA marker into anelectrical output signal; d) storing the plurality of electrical outputsignals in a matrix data structure and associating in said matrix datastructure each electrical output signal with the corresponding sequenceof the rRNA marker from whose abundance the electrical output signal wastransduced; e) designating the plurality of electrical output signalscorresponding to the plurality of marker abundances from said sample andbeing stored in the matrix data structure as an MDP; f) repeating stepsa-e for at least one other sample, and designating the plurality of MDPsas an MDP database.
 47. A method for constructing a marker diversityprofile (MDP) database comprising the steps of: a) sequencing aplurality of rRNA markers from a sample, wherein said sample comprisesan un-cultivated microbial population, and wherein each rRNA markercomprises a microbial rRNA gene polymorphic sequence; b) determining theabundance in said sample of each rRNA marker; c) transducing theabundance of each rRNA marker into an electrical output signal; d)storing the plurality of electrical output signals in a matrix datastructure and associating in said matrix data structure each electricaloutput signal with the corresponding sequence of the rRNA marker fromwhose abundance the electrical output signal was transduced; e)designating the plurality of electrical output signals corresponding tothe plurality of marker abundances from said sample and being stored inthe matrix data structure as an MDP; f) repeating steps a-e for at leastone other sample, and designating the plurality of MDPs as an MDPdatabase.
 48. The method according to any one of claims 45-47, whereinat least one MDP of said MDP database further comprises a secondplurality of electrical output signals produced by a method comprisingthe steps of: i) providing the abundances of one or more non-microbialsample parameters that are associated with the sample from which saidMDP is derived; ii) transducing the abundances of said non-microbialsample parameters into a plurality of electrical output signals; iii)storing said plurality of electrical output signals in a matrix datastructure and associating in said structure each output signal with thenon-microbial sample parameter from whose abundance the output signalwas transduced; iv) designating the plurality of electrical outputsignals corresponding to the plurality of non-microbial sample parameterabundances and being stored in the matrix data structure as the secondplurality of electrical output signals.
 49. The method according to anyone of claims 45-47, wherein said rRNA marker abundance is an abundancerelative to the total abundance of said plurality of the sequenced rRNAmarkers in the sample.
 50. The method according to any one of claims45-47, wherein said microbial rRNA gene is a 16S rRNA gene.
 51. Themethod according to any one of claims 45-47, wherein said microbial rRNAgene polymorphic sequence is from the intergenic region between a 16SrRNA gene and a 23S rRNA gene.
 52. The method according to any one ofclaims 45-47, wherein the sample is selected from the group consistingof: a soil sample, a rock sample, a water sample, an air sample, ahydrocarbon sample, a petroleum sample and a biofilm sample.
 53. Themethod according to any one of claims 45-47, wherein the sample isobtained from the group consisting of: an oil reservoir, a gasreservoir, a building and a roadway.
 54. The method according to any oneof claims 45-47, wherein the sample is obtained from the groupconsisting of: a human, a plant, an animal, a foodstuff, a body tissuesample, a body fluid sample, a cell culture and a tissue culture.